Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(listing): pick exact or create new one on update #726

Merged
merged 1 commit into from
Dec 23, 2024

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented Dec 20, 2024

Partially addresses and stabilizes #725

Fixes a few situations in the way we find and suggest an existing cached listing when we do an update=True.

1. There was a bug in the origin code:

# choosing the smallest possible one to minimize update time
listing = sorted(listings, key=lambda ls: len(ls.name))[0]

This code actually was picking the "biggest" one instead. There should have been reverse=True.

So, in a situation like:

file:///whatever/dir1 exists and cached
file:///whatever/ exists and cached

and we were trying to update the file:///whatever/dir1 - it was updating the file:///whatever/ instead.

2. Then, in a situation like:

file:///whatever/ exists and cached
file:///whatever/dir1 doesn't yet exist

and we do "update=True" for the file:///whatever/dir1

it was updating the whole file:///whatever/ which can be arbitrary large. And we don't want to do this.

So, bottom line is:

Generalizing a bit code for both of those cases led to the code in this PR.

@shcheklein shcheklein added the bug Something isn't working label Dec 20, 2024
@shcheklein shcheklein self-assigned this Dec 20, 2024
Copy link

cloudflare-workers-and-pages bot commented Dec 20, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3bcaa78
Status: ✅  Deploy successful!
Preview URL: https://94f10ba5.datachain-documentation.pages.dev
Branch Preview URL: https://fix-listing-selection.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.43%. Comparing base (20c73b2) to head (3bcaa78).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #726      +/-   ##
==========================================
- Coverage   87.43%   87.43%   -0.01%     
==========================================
  Files         114      114              
  Lines       10967    10965       -2     
  Branches     1508     1507       -1     
==========================================
- Hits         9589     9587       -2     
  Misses        998      998              
  Partials      380      380              
Flag Coverage Δ
datachain 87.36% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
if listings and not update:
listing = sorted(listings, key=lambda ls: ls.created_at)[-1]

# For local file system we need to fix listing path / prefix
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilongin Q: why do we need to do this only for the local file system?

@shcheklein shcheklein changed the title fix(listing): pick actually the smallest one to update fix(listing): pick exact or create new one on update Dec 21, 2024
@shcheklein shcheklein requested a review from ilongin December 21, 2024 20:08
@shcheklein shcheklein marked this pull request as ready for review December 21, 2024 20:08
@shcheklein shcheklein requested a review from a team December 21, 2024 20:08
@@ -38,20 +39,43 @@ def _tree_to_entries(tree: dict, path=""):
@pytest.fixture
def listing(test_session):
catalog = test_session.catalog
dataset_name, _, _, _ = DataChain.parse_uri("s3://whatever", test_session)
dataset_name, _, _, _ = DataChain.parse_uri("file:///whatever", test_session)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[C]: minor semi-related change - there is not reason to use s3, tests were failing since it was hitting actual token update w/o proper fixtures. cc @ilongin

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@shcheklein shcheklein merged commit 848f8a2 into main Dec 23, 2024
34 checks passed
@shcheklein shcheklein deleted the fix-listing-selection branch December 23, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants